{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 助教哥你好呀~\n",
"\n",
"### 这个文件是可以**直接运行**的加载模型的代码\n",
"\n",
"-----\n",
"\n",
"*因为中间有一些处理数据的过程, 整个文件运行时间大概在十分钟, 用jupyter notebook打开可以看到我最后一次运行与输出的结果*\n",
"\n",
"##### 助教哥辛苦了"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"E:\\Anaconda3\\lib\\site-packages\\h5py\\__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n",
" from ._conv import register_converters as _register_converters\n",
"Using TensorFlow backend.\n"
]
}
],
"source": [
"from keras.models import load_model\n",
"\n",
"model = load_model('最高分的训练好的模型.h5')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**接下来 读取预处理好的测试数据**"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" text | \n",
" class | \n",
" positive | \n",
"
\n",
" \n",
" | index | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 18年结婚 哈哈哈 | \n",
" 0 | \n",
" 0.900696 | \n",
"
\n",
" \n",
" | 1 | \n",
" 2017最后顿大餐吃完两人世界明年就是三个人一起啦许下生日愿望️希望一家人都能顺利平安健康🏻🏻🏻 | \n",
" 1 | \n",
" 0.999904 | \n",
"
\n",
" \n",
" | 2 | \n",
" 意盎然的季节!祝愿大家都生机勃勃,郁郁葱葱! | \n",
" 2 | \n",
" 0.736431 | \n",
"
\n",
" \n",
" | 3 | \n",
" 2017 遇见挚友 遇见我老公 结了婚有了小芒果 希望2018也超级美好️ | \n",
" 3 | \n",
" 0.983905 | \n",
"
\n",
" \n",
" | 4 | \n",
" 2018.1.1 | \n",
" 4 | \n",
" 0.500000 | \n",
"
\n",
" \n",
" | 5 | \n",
" 2018加油! | \n",
" 5 | \n",
" 0.895319 | \n",
"
\n",
" \n",
" | 6 | \n",
" 2018年做一个更加真实的自己。️ | \n",
" 3 | \n",
" 0.783433 | \n",
"
\n",
" \n",
" | 7 | \n",
" 2018年的第一天,完美的错过了一辆公交车。 德州 | \n",
" 6 | \n",
" 0.934181 | \n",
"
\n",
" \n",
" | 8 | \n",
" 2018年目标1.赚钱买房2.谈场恋爱,遇到对的人就结婚3.拥有一副健康的身体4.学会一种乐... | \n",
" 7 | \n",
" 0.999799 | \n",
"
\n",
" \n",
" | 9 | \n",
" 2018年第一个假期:元旦,就这么过去了,感冒咳嗽发高烧给这个元旦带来了不一样的节日,好快呀... | \n",
" 8 | \n",
" 0.733896 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" text class positive\n",
"index \n",
"0 18年结婚 哈哈哈 0 0.900696\n",
"1 2017最后顿大餐吃完两人世界明年就是三个人一起啦许下生日愿望️希望一家人都能顺利平安健康🏻🏻🏻 1 0.999904\n",
"2 意盎然的季节!祝愿大家都生机勃勃,郁郁葱葱! 2 0.736431\n",
"3 2017 遇见挚友 遇见我老公 结了婚有了小芒果 希望2018也超级美好️ 3 0.983905\n",
"4 2018.1.1 4 0.500000\n",
"5 2018加油! 5 0.895319\n",
"6 2018年做一个更加真实的自己。️ 3 0.783433\n",
"7 2018年的第一天,完美的错过了一辆公交车。 德州 6 0.934181\n",
"8 2018年目标1.赚钱买房2.谈场恋爱,遇到对的人就结婚3.拥有一副健康的身体4.学会一种乐... 7 0.999799\n",
"9 2018年第一个假期:元旦,就这么过去了,感冒咳嗽发高烧给这个元旦带来了不一样的节日,好快呀... 8 0.733896"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"import jieba\n",
"\n",
"dff = pd.read_csv(\"./Preprocessed_data/train.csv\",index_col=0)\n",
"dff['text'] = dff['text'].fillna('')\n",
"dff.head(10)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" text | \n",
" class | \n",
" positive | \n",
"
\n",
" \n",
" | index | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 我是正面哦 | \n",
" 0 | \n",
" 0.347826 | \n",
"
\n",
" \n",
" | 1 | \n",
" 爱是恒久忍耐,又有恩慈。爱是不嫉妒,不自夸,不张狂,不轻易发怒。不计算人的恶。凡事包容。凡事... | \n",
" 0 | \n",
" 0.496333 | \n",
"
\n",
" \n",
" | 2 | \n",
" 讨厌死了,上班上班上班不停的上班我真的超级累。什么都不干还是超级超级累。 | \n",
" 0 | \n",
" 0.000422 | \n",
"
\n",
" \n",
" | 3 | \n",
" 矮马大半夜的放肌肉男不让人睡觉了 | \n",
" 0 | \n",
" 0.409895 | \n",
"
\n",
" \n",
" | 4 | \n",
" 谢谢陈先生。 | \n",
" 0 | \n",
" 0.768959 | \n",
"
\n",
" \n",
" | 5 | \n",
" 我的2016要早点睡别熬夜 | \n",
" 0 | \n",
" 0.625607 | \n",
"
\n",
" \n",
" | 6 | \n",
" 周锐锐哥!爱你 | \n",
" 0 | \n",
" 0.970187 | \n",
"
\n",
" \n",
" | 7 | \n",
" 塞尼亚岛 | \n",
" 0 | \n",
" 0.500000 | \n",
"
\n",
" \n",
" | 8 | \n",
" 只可惜没能去现场 | \n",
" 0 | \n",
" 0.100791 | \n",
"
\n",
" \n",
" | 9 | \n",
" 自从发现这个号都处于一种忍不住不看看了睡不着的状态 | \n",
" 0 | \n",
" 0.355194 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" text class positive\n",
"index \n",
"0 我是正面哦 0 0.347826\n",
"1 爱是恒久忍耐,又有恩慈。爱是不嫉妒,不自夸,不张狂,不轻易发怒。不计算人的恶。凡事包容。凡事... 0 0.496333\n",
"2 讨厌死了,上班上班上班不停的上班我真的超级累。什么都不干还是超级超级累。 0 0.000422\n",
"3 矮马大半夜的放肌肉男不让人睡觉了 0 0.409895\n",
"4 谢谢陈先生。 0 0.768959\n",
"5 我的2016要早点睡别熬夜 0 0.625607\n",
"6 周锐锐哥!爱你 0 0.970187\n",
"7 塞尼亚岛 0 0.500000\n",
"8 只可惜没能去现场 0 0.100791\n",
"9 自从发现这个号都处于一种忍不住不看看了睡不着的状态 0 0.355194"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dfTest = pd.read_csv(\"./Preprocessed_data/test.csv\",index_col=0)\n",
"dfTest['text'] = dfTest['text'].fillna('')\n",
"dfTest.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**还有一点处理, 很快了**"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Building prefix dict from the default dictionary ...\n",
"Loading model from cache C:\\Users\\Kai\\AppData\\Local\\Temp\\jieba.cache\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"0\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Loading model cost 0.810 seconds.\n",
"Prefix dict has been built succesfully.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"100000\n",
"200000\n",
"300000\n",
"400000\n",
"500000\n",
"600000\n",
"700000\n",
"800000\n",
"0\n",
"100000\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"E:\\Anaconda3\\lib\\site-packages\\keras_preprocessing\\text.py:178: UserWarning: The `nb_words` argument in `Tokenizer` has been renamed `num_words`.\n",
" warnings.warn('The `nb_words` argument in `Tokenizer` '\n"
]
}
],
"source": [
"def stopwordslist():\n",
" f = open(\"./Preprocessed_data/stop.txt\", \"r\")\n",
" line = f.readline()\n",
" stopwords = []\n",
" index = 0\n",
" while line:\n",
" index += 1\n",
" line = line.replace('\\n', '')\n",
" line = line.replace('[', '')\n",
" line = line.replace(']', '')\n",
" line = line.replace(']', '')\n",
" line = line.replace('[', '')\n",
" \n",
" stopwords.append(line)\n",
" line = f.readline()\n",
"\n",
" return stopwords\n",
"\n",
"stopwords = stopwordslist()\n",
"\n",
"def seg_depart(sentence):\n",
" sentence_depart = jieba.cut(sentence.strip())\n",
" outstr = ''\n",
" for word in sentence_depart:\n",
" if word not in stopwords:\n",
" if word != '\\t':\n",
" outstr += word\n",
" outstr += \" \"\n",
" return outstr\n",
"\n",
"sen = dff['text'].values\n",
"\n",
"for i in range(len(sen)):\n",
" if i % 100000 == 0:\n",
" print(i)\n",
" sen[i] = seg_depart(sen[i])\n",
" \n",
"\n",
"senTest = dfTest['text'].values\n",
"\n",
"for i in range(len(senTest)):\n",
" if i % 100000 == 0:\n",
" print(i)\n",
" senTest[i] = seg_depart(senTest[i])\n",
" \n",
"\n",
"from keras.preprocessing.text import Tokenizer\n",
"from keras.preprocessing.sequence import pad_sequences\n",
"\n",
"MAX_NB_WORDS = 20000\n",
"tokenizer = Tokenizer(nb_words=MAX_NB_WORDS, char_level=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**上面的输出是运行进度的一些信息, 上面的cell大概需要运行五分钟**\n",
"\n",
"*很快就好啦*"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"tokenizer.fit_on_texts(sen)\n",
"sequences_test = tokenizer.texts_to_sequences(senTest)\n",
"MAX_SEQUENCE_LENGTH = 300\n",
"\n",
"x_test = pad_sequences(sequences_test, maxlen=MAX_SEQUENCE_LENGTH)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0\n",
"50000\n",
"100000\n",
"150000\n"
]
}
],
"source": [
"import numpy as np\n",
"import csv\n",
"\n",
"pred = model.predict(x_test)\n",
"result = np.argmax(pred, axis = 1)\n",
"\n",
"# 写入文件\n",
"csvFile = open('FORCheckResult.csv','w', newline='', encoding='UTF-8') # 设置newline,否则两行之间会空一行\n",
"writer = csv.writer(csvFile)\n",
"\n",
"writer.writerow(['ID', 'Expected'])\n",
"for i in range(len(result)):\n",
" if i % 50000 == 0:\n",
" print(i)\n",
" writer.writerow([int(i), int(result[i])])\n",
" \n",
"csvFile.close()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 最高分的训练好的模型.h5 预测的 test.data 已经被输出到当前文件夹下的 FORCheckResult.csv 啦\n",
"\n",
"\n",
"\n",
"#### 辛苦了 {心}"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}